-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory consumption of SingleCells.merge_single_cells #234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Please address my comments (focusing on the testing comments)
Also, would reducing to float32
here help a lot as well? This is within scope of the name of your PR
"Metadata_ObjectNumber_cells", | ||
] | ||
# check that we have the same data using same cols, sort and a reset index | ||
pd.testing.assert_frame_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't looked blacked to me, please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran black on this, and --check
doesn't seem to flag it (may be profile related). I'm seeing that there's a list end bumping up against the comment, I'll add in a newline there with new changes. Are you seeing anything else that looks stylistically out of place which I may change?
@@ -493,8 +517,8 @@ def test_merge_single_cells_cytominer_database_test_file(): | |||
# note: pd.DataFrame datatypes sometimes appear automatically changed on-read, so we cast | |||
# the result_file dataframe using the base dataframe's types. | |||
pd.testing.assert_frame_equal( | |||
pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict()), | |||
pd.read_csv(result_file).astype(merged_sc.dtypes.to_dict()), | |||
pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict())[merged_sc.columns], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for calling this out! For inner merges, Pandas "...preserve[s] the order of the left keys." (reference: pd.merge(..., how)). The column specification here is provided to account for the now differently ordered columns due to the "right" and "left" being swapped within the compartment merges. I also felt that accounting for exact column ordering over time may be unwieldy, so this may provide benefit towards development velocity. That said, we may need a change; do you think we should enforce strict column order within merge_single_cells
to remove the need for specification here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gwaybio - just wanted to follow up here with some thoughts about merge_single_cells
column order specification. If we make explicit column sorting a part of the method, we should be able to reduce the variability of data produced. Note: tests may still require column filtering (for example, if certain data do not exist in one dataframe vs another).
I wrote the following as an example of what we could do here:
import pandas as pd
# create an example dataframe with mixed order columns
df = pd.DataFrame(
{
"Image_Metadata_1": [0],
"TableNumber": [0],
"ImageNumber": [0],
"Cytoplasm_Data_1": [0],
"Nuclei_Data_1": [0],
"Cells_Data_1": [0],
"Mito_Data_1": [0],
"Actin_Data_1": [0],
"Image_Data_1": [0],
}
)
# print df columns as a python list, representing new order
print("Initial order:")
print(df.columns.tolist())
def custom_sort(value: str):
"""
A custom sort for Pycytominer merge_single_cells
pd.dataframe columns
"""
# lowercase str which will be used for comparisons
# to avoid any capitalization challenges
value_lower = value.lower()
# first sorted values (by list index)
sort_first = ["tablenumber", "imagenumber"]
# middle sort value
sort_middle = "metadata"
# sorted last (by list order enumeration)
sort_later = [
"cells",
"cytoplasm",
"nuclei",
"image",
]
# if value is in the sort_first list
# return the index from that list
if value_lower in sort_first:
return sort_first.index(value_lower)
# if sort_middle is anywhere in value return
# next index value after sort_first values
elif sort_middle in value_lower:
return len(sort_first)
# if any sort_later are found as the first part of value
# return enumerated index of sort_later value (starting from
# relative len based on the above conditionals and lists)
elif any(value_lower.startswith(val) for val in sort_later):
for k, v in enumerate(sort_later, start=len(sort_first) + 1):
if value_lower.startswith(v):
return k
# else we return the total length of all sort values
return len(sort_first) + len(sort_later) + 1
# inner sorted alphabetizes any columns which may not be part of custom_sort
# outer sort provides pycytominer-specific column sort order
df = df[sorted(sorted(df.columns), key=custom_sort)]
# print df columns as a python list, representing new order
print("\nSorted order:")
print(df.columns.tolist())
Which has printed output:
Initial order:
['Image_Metadata_1', 'TableNumber', 'ImageNumber', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Cells_Data_1', 'Mito_Data_1', 'Actin_Data_1', 'Image_Data_1']
Sorted order:
['TableNumber', 'ImageNumber', 'Image_Metadata_1', 'Cells_Data_1', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Image_Data_1', 'Actin_Data_1', 'Mito_Data_1']
# check that we have the same data using same cols, sort and a reset index | ||
pd.testing.assert_frame_equal( | ||
left=manual_merge[assert_cols] | ||
.sort_values(by=assert_cols, ascending=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned that sorting is giving us undue confidence of equivalent results. Am I reading this wrong? How should I be thinking about this?
Can you add an explicit check that the object number is the same? Oh, but maybe this code block is doing exactly that (and more!)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment, thank you! The chunked merges made the data inconsistent with the existing tests. The check here now looks at columns "Metadata_ObjectNumber", "Metadata_ObjectNumber_cytoplasm", "Metadata_ObjectNumber_cells", sorted by the same columns in the same order for data equivalency and does include an Metadata_ObjectNumber check as a result.
Would users of merge_single_cells
expect for sorted output by the same columns or similar? We could include this as an additional step within the method, which would likely avoid needing to do that manually within testing here.
Co-authored-by: Gregory Way <[email protected]>
Thank you @gwaybio for the comments and feedback! I added comment + changes for individual thoughts and wanted to follow up here to your mention about |
Yes, good idea to make an optional/experimentally-untested argument
…On Fri, Oct 7, 2022, 2:01 PM Dave Bunten ***@***.***> wrote:
Thank you @gwaybio <https://github.com/gwaybio> for the comments and
feedback! I added comment + changes for individual thoughts and wanted to
follow up here to your mention about float32. Converting from float64 (a
default which I believe cascades from numpy to pandas) to float32 would
very likely reduce the memory footprint of this method. Because this could
mean data loss under some circumstances, would it be beneficial to make
this an additional argument to allow development options?
—
Reply to this email directly, view it on GitHub
<#234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYADBI5XBB6TYVQG3PBVXLWCB6QRANCNFSM6AAAAAAQ7A5MYU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@gwaybio and @axiomcura - I made some changes to allow optional float_dtype specification for merge_single_cells. The argument, |
@d33bs - sorry to have lost track of this PR. Can you remind me where we are with this? is it ready to merge? |
Thanks @gwaybio and no worries! I feel that this PR should be closed without a merge, with justifications and detailed thoughts below. Very open to other ideas and discussion here! 🙂 Justifications on Closing without Merge:
Forward momentum:
What are your thoughts? Should we move forward with any of the above as outlined? Please let me know if you have questions/suggestions/comments throughout. |
Thanks @d33bs for catching me up!
No. I feel that because of limited bandwidth, all of our efforts should be placed toward pycytominer-transform at the moment. I feel that once we have transform, we can start to deprecate SingleCells functionality, which will invariably make these short term patches a waste of time. However, if you sense that any of these patches might be really quick, then let's do them! There is an unfortunate time trade-off here, and we need to be really thoughtful about were to allocate attention. I'll close now and we'll use a short part of our meeting this morning to discuss any of these short-term pycytominer targets. Thanks again! |
Description
These changes seek to address #233 memory challenges when using
SingleCells.merge_single_cells
by segmenting Pandas merges using dynamic argumentchunksize
and related dataframe concatenation within the method. In brief testing using small datasets with these changes I saw around 12-18% reduction in peak memory use (usingchunksize
default).Using this segmented merge approach we're making a tradeoff between memory and time; in large data scenarios we may avoid out-of-memory issues and the procedure may take longer due to more (but smaller) merges being required.
chunksize
is an optional argument which when left to it's default ofNone
will automatically be set to 1/3rd of the first compartment dataframe rowlength. Performance may vary withchunksize
contingent on the dataset and is the reasoning for this being an argument which may be optionally set by a developer.Caveats:
FutureWarning: Passing 'suffixes' which cause duplicate columns {'ObjectNumber_cytoplasm'} in the result is deprecated and will raise a MergeError in a future version.
).test_cells.py
needed updates in order for the changes withincells.py
to pass pytesting. With these changes I focused on ensuring the data itself was as expected. I'd ask for double-checks here especially to make sure everything looks good.chunksize
default does not guarantee avoidance of out-of-memory issues and instead provides flexibility for developer implementation.Thank you @axiomcura for the excellent writeup on the performance issue!
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.